Improve search: multi-term AND + relevance ranking (FTS spike) by rdhyee · Pull Request #95 · isamplesorg/isamplesorg.github.io

rdhyee · 2026-04-09T00:20:11Z

Summary

Closes #84 — FTS spike complete with immediate search improvements and documented future path.

Shipped now (zero new dependencies):

Multi-term search: "pottery Cyprus" requires BOTH words to match (was OR on the full phrase)
Relevance ranking: results sorted by score when searching — label match = 3pts, place = 2pts, description = 1pt
When not searching, results remain random for exploration variety

FTS spike findings:

Built offline DuckDB FTS index with tools/build_fts_index.py
Full index (label + description + place_name): 358 MB — too large for auto-download
Lite index (label + place_name only): 211 MB — still substantial
BM25 scoring works well (Porter stemming, English stopwords)
ATTACH over HTTP in DuckDB-WASM is supported but downloading 200-358 MB is impractical

Recommended next steps (not in this PR):

Explore pre-tokenized search parquet (inverted index as parquet, much smaller)
Consider on-demand FTS loading behind an "Enhanced Search" toggle
Evaluate DuckDB text analytics functions (stemming without full index)

Test plan

Search "pottery" → results ranked by relevance (label matches first)
Search "pottery Cyprus" → only samples matching BOTH words
Search "basalt" → geological samples with label matches at top
Clear search → results return to random sampling
Verify tools/build_fts_index.py runs successfully with local parquet

🤖 Generated with Claude Code

Search improvements (immediate): - Multi-term search: "pottery Cyprus" requires BOTH words to match - Relevance ranking: label matches weighted 3x, place 2x, description 1x - Results sorted by relevance score when searching (random for browsing) FTS spike (future path, documented): - Added tools/build_fts_index.py to build DuckDB FTS index offline - Tested: 358 MB full index, 211 MB lite — too large for auto-download - BM25 scoring works correctly (Porter stemming, stopwords) - Next step: explore smaller index strategies or on-demand loading Closes isamplesorg#84 (spike complete — findings documented in PR) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Search input was passed into ILIKE patterns with only single-quote escaping, so a literal "%" or "_" in the query (e.g. "100%", "co_op") silently turned into wildcards. Escape % _ \ and add ESCAPE '\' in both whereClause and the relevance-score expression. Also reframe tools/build_fts_index.py as a spike artifact: the docstring told readers to upload the index to data.isamples.org, but per PR isamplesorg#95 findings the 200-358 MB result is too large to ship. Mark the script NOT in production pipeline and drop the misleading upload instructions. Smoke-tested locally with /tmp/explorer_smoke_test.py (multi-term "pottery cyprus" + wildcard "100%"): 0 JS exceptions, 0 console errors, 0 failed requests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rdhyee · 2026-04-28T20:46:26Z

Reviewed and pushed two small follow-ups (134aca2):

1. ILIKE wildcard escaping. Search input was passed into the ILIKE pattern with only single-quote escaping, so literal % or _ in the query (e.g. 100%, co_op) silently became wildcards. Now escape % _ \ and add ESCAPE '\' in both the whereClause block and the relevance-score expression.

2. FTS spike script header. tools/build_fts_index.py told readers to "upload to data.isamples.org" but per the PR's own findings the 200-358 MB result is too large to ship. Reframed as STATUS: spike artifact — NOT in production pipeline, kept the script for future revisits, dropped the misleading upload instructions.

Smoke test (/tmp/explorer_smoke_test.py against local Quarto render):

Serving docs on :64856
URL: http://127.0.0.1:64856/tutorials/isamples_explorer.html
JS exceptions:    0
Console errors:   0
Failed requests:  0
RESULT: PASS

Exercised: initial load, multi-term search (pottery cyprus), wildcard-char search (100%). Screenshot confirms the new placeholder and that 100% no longer matches everything.

Other notes from review (not blocking):

Score expression has discrete plateaus (0/1/2/3/5/6 per term); ties break alphabetically on label. Fine for spike — could mention in placeholder docs later.
description ILIKE over the wide parquet over HTTP range-fetch may add first-search latency; worth a ?perf=1 measurement before declaring search "done", but out of scope here.

LGTM to merge once you've eyeballed the diff.

rdhyee mentioned this pull request Apr 9, 2026

iSamples MVP Cleanup & Simplification Strategy #49

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve search: multi-term AND + relevance ranking (FTS spike)#95

Improve search: multi-term AND + relevance ranking (FTS spike)#95
rdhyee wants to merge 2 commits intoisamplesorg:mainfrom
rdhyee:feature/fts-spike

rdhyee commented Apr 9, 2026

Uh oh!

rdhyee commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rdhyee commented Apr 9, 2026

Summary

Test plan

Uh oh!

rdhyee commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant